**CS2100 Computer Organisation**

**Tutorial #10: Pipelining**

**Answers**

***LumiNUS Discussion Questions***

D1. Suppose the four stages in some 4-stage pipeline take the following timing: 2ns, 3ns, 4ns, and 2ns. Given 1000 instructions, what is the speedup (in two decimal places) of the pipelined processor compared to the non-pipelined single-cycle processor?

***Answer:***

Non-pipelined: 11ns × 1000 = 11,000ns.

In a pipelined system, each stage takes 4ns.

Pipelined: 12ns + (1,000 × 4) ns = 12 ns + 4,000ns = 4,012ns.

Speedup = 11,000/4,012 = **2.74**

D2. Let’s try to understand pipeline processor by doing a detailed trace. Suppose the pipeline registers (also known as pipeline latches) store the following information*:*

|  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **IF / ID** | |  | **ID / EX** | |  | **EX / MEM** | |  | **MEM / WB** | |
| **No Control Signal** | **---------** |  | **MToR** |  |  | **MToR** |  |  | **MToR** |  |
|  | **RegWr** |  |  | **RegWr** |  |  |
|  | **MemRd** |  |  | **MemRd** |  |  |
|  | **MemWr** |  |  |  |
|  | **Branch** |  |  | **MemWr** |  |  | **RegWr** |  |
|  | **RegDst** |  |  |  |
|  | **ALUsrc** |  |  | **Branch** |  |  |
|  | **ALUop** |  |  |  |
| **PC+4** |  |  | **PC+4** |  |  | **BrcTgt** |  |  | **MemRes** |  |
| **OpCode** |  |  |  | **isZero?** |  |  |
| **Rs** |  |  | **ALUOpr1** |  |  | **ALURes** |  |  | **ALURes** |  |
| **Rt** |  |  | **ALUOpr2** |  |  | **ALUOpr2** |  |  |
| **Rd** |  |  | **Rt** |  |  |  | **DstRNum** |  |
| **Funct** |  |  | **Rd** |  |  | **DstRNum** |  |  |
| **Imm(16)** |  |  | **Imm(32)** |  |  |  |

Show the progress of the following instructions through the pipeline stages by filling in the content of pipeline registers. Note that these are the same instructions from Tutorial #5 Question 1 so that you can reuse some of the answers here.

i. **0x8df80000 # lw $24, 0($15) #Inst.Addr = 0x100**

ii. **0x1023000C # beq $1, $3, 12 #Inst.Addr = 0x100**

iii. **0x0285c822 # sub $25, $20, $5 #Inst.Addr = 0x100**

Assume that registers 1 to 31 have been initialized to a value that is equal to 101 + its register number. i.e. [$1] = 102, [$31] = 132 etc. You can put “X” in fields that are irrelevant for that instruction. Do note that in reality, these fields are actually generated but not utilized.

Part (i) has been worked out for you.

i. **0x8df80000 # lw $24, 0($15) #Inst.Addr = 0x100**

|  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **IF / ID** | |  | **ID / EX** | |  | **EX / MEM** | |  | **MEM / WB** | |
| **No Control Signal** | **---------** |  | **MToR** | **1** |  | **MToR** | **1** |  | **MToR** | **1** |
|  | **RegWr** | **1** |  | **RegWr** | **1** |  |
|  | **MemRd** | **1** |  | **MemRd** | **1** |  |
|  | **MemWr** | **0** |  |  |
|  | **Branch** | **0** |  | **MemWr** | **0** |  | **RegWr** | **1** |
|  | **RegDst** | **0** |  |  |
|  | **ALUsrc** | **1** |  | **Branch** | **0** |  |
|  | **ALUop** | **00** |  |  |
| **PC+4** | **0x104** |  | **PC+4** | **0x104** |  | **BrcTgt** | **X** |  | **MemRes** | **Mem(116)** |
| **OpCode** | **0x23** |  |  | **isZero?** | **X** |  |
| **Rs** | **$15** |  | **ALUOpr1** | **116** |  | **ALURes** | **116** |  | **ALURes** | **X** |
| **Rt** | **$24** |  | **ALUOpr2** | **X** |  | **ALUOpr2** | **X** |  |
| **Rd** | **X** |  | **Rt** | **$24** |  |  | **DstRNum** | **$24** |
| **Funct** | **X** |  | **Rd** | **X** |  | **DstRNum** | **$24** |  |
| **Imm(16)** | **0** |  | **Imm(32)** | **0** |  |  |

***Answers:***

ii. **0x1023000C # beq $1, $3, 12 #Inst.Addr = 0x100**

|  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **IF / ID** | |  | **ID / EX** | |  | **EX / MEM** | |  | **MEM / WB** | |
| **No Control Signal** | **---------** |  | **MToR** | **X** |  | **MToR** | **X** |  | **MToR** | **X** |
|  | **RegWr** | **0** |  | **RegWr** | **0** |  |
|  | **MemRd** | **0** |  | **MemRd** | **0** |  |
|  | **MemWr** | **0** |  |  |
|  | **Branch** | **1** |  | **MemWr** | **0** |  | **RegWr** | **0** |
|  | **RegDst** | **X** |  |  |
|  | **ALUsrc** | **0** |  | **Branch** | **1** |  |
|  | **ALUop** | **01** |  |  |
| **PC+4** | **0x104** |  | **PC+4** | **0x104** |  | **BrcTgt** | **0x134** |  | **MemRes** | **X** |
| **OpCode** | **4** |  |  | **isZero?** | **0** |  |
| **Rs** | **$1** |  | **ALUOpr1** | **102** |  | **ALURes** | **-2** |  | **ALURes** | **X** |
| **Rt** | **$3** |  | **ALUOpr2** | **104** |  | **ALUOpr2** | **X** |  |
| **Rd** | **X** |  | **Rt** | **X** |  |  | **DstRNum** | **X** |
| **Funct** | **X** |  | **Rd** | **X** |  | **DstRNum** | **X** |  |
| **Imm(16)** | **12** |  | **Imm(32)** | **12** |  |  |

iii. **0x0285c822 # sub $25, $20, $5 #Inst.Addr = 0x100**

|  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **IF / ID** | |  | **ID / EX** | |  | **EX / MEM** | |  | **MEM / WB** | |
| **No Control Signal** | **---------** |  | **MToR** | **0** |  | **MToR** | **0** |  | **MToR** | **0** |
|  | **RegWr** | **1** |  | **RegWr** | **1** |  |
|  | **MemRd** | **0** |  | **MemRd** | **0** |  |
|  | **MemWr** | **0** |  |  |
|  | **Branch** | **0** |  | **MemWr** | **0** |  | **RegWr** | **1** |
|  | **RegDst** | **1** |  |  |
|  | **ALUsrc** | **0** |  | **Branch** | **0** |  |
|  | **ALUop** | **10** |  |  |
| **PC+4** | **0x104** |  | **PC+4** | **0x104** |  | **BrcTgt** | **X** |  | **MemRes** | **X** |
| **OpCode** | **0** |  |  | **isZero?** | **X** |  |
| **Rs** | **$20** |  | **ALUOpr1** | **121** |  | **ALURes** | **15** |  | **ALURes** | **15** |
| **Rt** | **$5** |  | **ALUOpr2** | **106** |  | **ALUOpr2** | **X** |  |
| **Rd** | **$25** |  | **Rt** | **X** |  |  | **DstRNum** | **$25** |
| **Funct** | **22** |  | **Rd** | **$25** |  | **DstRNum** | **$25** |  |
| **Imm(16)** | **X** |  | **Imm(32)** | **X** |  |  |

D2. Given the following three formulas (See Lecture #20, Section 5 Performance):

For each of the following processor parameters, calculate *CTseq, CTpipeline* and *Speeduppipeline* (to two decimal places) for 10 instructions and for 10 million instructions.

|  |  |  |
| --- | --- | --- |
|  | Stages Timing (for 5 stages, in ps) | Latency of pipeline register (in ps) |
| a. | 300, 100, 200, 300, 100 (slow memory) | 0 |
| b. | 200, 200, 200, 200, 200 | 40 |
| c. | 200, 200, 200, 200, 200 (ideal) | 0 |

***Answers:***

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
|  | **CTseq** | **CTpipeline** | **Speedup (10 inst)** | **Speedup (10m inst)** |
| a. | **1000ps** | **300ps** | (1000×10)/(300×14)  **= 2.38** | (1000×10m)/(300 (10m+4))  **= 3.33** |
| b. | **1000ps** | **240ps** | (1000×10)/(240×14)  **= 2.98** | (1000×10m)/(240×(10m+4))  **= 4.17** |
| c. | **1000ps** | **200ps** | (1000×10)/(200×14)  **= 3.57** | (1000×10m)/(200×(10m+4))  **= 5.00** |

***Tutorial Questions***

1.[AY2014/5 Semester 2 Exam]

Refer to the following MIPS program:

# register $s0 contains a 32-bit value

# register $s1 contains a non-zero 8-bit value

# at the right most (least significant) byte

**add $t0, $s0, $zero** #inst A

**add $s2, $zero, $zero** #inst B

lp: **bne $s2, $zero, done** #inst C

**beq $t0, $zero, done** #inst D

**andi $t1, $t0, 0xFF** #inst E

**bne $s1, $t1, nt** #inst F

**addi $s2, $s2, 1** #inst G

nt: **srl $t0, $t0, 8** #inst H

**j lp** #inst J

done:

We assume that the register **$s0** contains **0xAFAFFAFA** and **$s1** contains **0xFF**.

Given a 5-stage MIPS pipeline processor, for each of the parts below, give the total number of cycles needed for the first iteration of the execution from instructions **A** to **H** (i.e. excluding the “**j lp**” instruction). Remember to include the cycles needed for instruction **H** to finish the WB stage. Note that the questions are independent from each other.

* 1. With only data forwarding mechanisms and no control hazard mechanism.
  2. With data forwarding and “assume not taken” branch prediction. Note that there is no early branching.

[Recall that early branching means branch decision is made at stage 2 (Decode stage); no early branch means branch decision is made at stage 4 (Memory stage).]

* 1. By swapping two instructions (from Instructions **A** to **H**), we can improve the performance of **early branching (with all additional forwarding paths)**. Give the two instructions that can be swapped. You only need to indicate the instruction letters in your answer.

Give the total number of cycles needed for the execution of the whole code in the worst case for each of the following assumptions. You may assume that the jump instruction (**j**) computes the address of the instruction to jump to in the MEM stage.

* 1. With only data forwarding mechanisms and no control hazard mechanism.
  2. With data forwarding and “assume not taken” branch prediction. Note that there is no early branching.

***Answers:***

1. **20 cycles**

|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| add | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| add |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| bne |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |
| beq |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |
| andi |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |
| bne | The addi instruction is not executed. |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |
| ~~addi~~ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| srl |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |

**(b) 14 cycles**

Cost of wrong prediction

|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| add | F | D | E | M | W |  |  |  |  |  |  |  |  |  |
| add |  | F | D | E | M | W |  |  |  |  |  |  |  |  |
| bne |  |  | F | D | E | M | W |  |  |  |  |  |  |  |
| beq |  |  |  | F | D | E | M | W |  |  |  |  |  |  |
| andi |  |  |  |  | F | D | E | M | W |  |  |  |  |  |
| bne |  |  |  |  |  | F | D | E | M | W |  |  |  |  |
| addi |  |  |  |  |  |  | F | D | E | \* | \* |  |  |  |
| srl |  |  |  |  |  |  |  | F | D | \* | \* | \* |  |  |
| j |  |  |  |  |  |  |  |  | F | \* | \* | \* | \* |  |
| srl |  |  |  |  |  |  |  |  |  | F | D | E | M | W |

**(c) Swap instructions A and B to reduce the delay between instructions B and C.**

**add $s2, $zero, $zero** #inst B

**add $t0, $s0, $zero** #inst A

lp: **bne $s2, $zero, done** #inst C

**beq $t0, $zero, done** #inst D

**andi $t1, $t0, 0xFF** #inst E

**bne $s1, $t1, nt** #inst F

**addi $s2, $s2, 1** #inst G

nt: **srl $t0, $t0, 8** #inst H

**j lp** #inst J

done:

**(d)**

|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| add | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| add |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| bne |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| beq |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |
| andi |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |
| bne |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |
| ~~addi~~ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| srl |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |
| j |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |
| bne |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F |

In the worst case, 4 bytes of the data are examined 🡪 4 iterations. The loop from Instructions C to J (one iteration) takes 18 cycles (not counting the WB stage of the j instruction which overlaps with the bne instruction). There are 2 cycles before the first iteration, and 9 cycles for Instructions C and D in the fifth iteration.

Therefore, total = 2 + (4×18) + 9 = **83 cycles**.

**(e)**

|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| add | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |
| add |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |
| bne |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |
| beq |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |
| andi |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |
| bne |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |
| addi |  |  |  |  |  |  | F | D | E | \* | \* |  |  |  |  |
| srl |  |  |  |  |  |  |  | F | D | \* | \* | \* |  |  |  |
| j |  |  |  |  |  |  |  |  | F | \* | \* | \* | \* |  |  |
| srl |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |
| j |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |
| bne |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F |

In the worst case, 4 bytes of the data are examined 🡪 4 iterations. The loop from Instructions C to J (one iteration) takes 12 cycles (not counting the WB stage of the j instruction which overlaps with the bne instruction). There are 2 cycles before the first iteration, and 6 cycles for Instructions C and D in the fifth iteration.

Therefore, total = 2 + (4×12) + 6 = **56 cycles**.

2.[AY2017/8 Semester 2 Exam]

Refer to the MIPS code below. *A* and *B* are integer arrays whose base addresses are in **$s0** and **$s1** respectively. The arrays are of the same size *n* (number of elements). **$s2** contains the value *n*. For this question, we will focus on the code from Instruction 1 onwards.

**.data**

**A: .word 11, 9, 31, 2, 9, 1, 6, 10**

**B: .word 3, 7, 2, 12, 11, 41, 19, 35**

**n: .word 8**

**.text**

***main:* la $s0, A # $s0 is the base address of array A**

**la $s1, B # $s1 is the base address of array B**

**la $t0, n # $t0 is the addr of n (size of array)**

**# $s2 is the content of n**

**beq $s2, $zero, *End*** # Inst1

**addi $t8, $s2, -1** # Inst2

**sll $t8, $t8, 2** # Inst3

***Loop:* add $t0, $s0, $t8** # Inst4

**add $t1, $s1, $t8** # Inst5

**lw $t2, 0($t0)** # Inst6

**lw $t3, 0($t1)** # Inst7

**andi $t4, $t3, 3** # Inst8

**addi $t4, $t4, -3** # Inst9

**beq $t4, $zero, *A1*** # Inst10

**add $t2, $t2, $t3** # Inst11

**j *A2*** # Inst12

***A1:* addi $t2, $t2, 1** # Inst13

***A2:* sw $t2, 0($t0)** # Inst14

**addi $t8, $t8, -8** # Inst15

**slt $t7, $t8, $zero** # Inst16

**beq $t7, $zero, *Loop*** # Inst17

***End:***

Assuming a 5-stage MIPS pipeline system with forwarding and early branching, that is, the branch decision is made at the ID stage. No branch prediction is made and no delayed branching is used. For the jump (**j**) instruction, the computation of the target address to jump to is done at the ID stage as well.

Assume also that the first **beq** instruction begins at cycle 1.

a. Suppose arrays *A* and *B* now each contains 200 positive integers. What is the minimum number and maximum number of instructions executed? (Consider only the above code segment from Inst1 to Inst17.)

b. List out the instructions where some stall cycle(s) are inserted in executing that instruction in the pipeline. These include delay caused by data dependency and control hazard. You may write the instruction number InstX instead of writing out the instruction in full.

c. How many cycles does one iteration of the loop (from Inst1 to Inst17) take if the **beq** instruction at Inst10 branches to *A1*? You have to count until the WB stage of Inst17.

d. How many cycles does one iteration of the loop (from Inst1 to Inst17) take if the **beq** instruction at Inst10 does not branch to *A1*? You have to count until the WB stage of Inst17.

***Answers:***

**The code does this:**

**int i;**

**for (i=n-1; i>=0; i-=2) {**

**if (B[i]%4 == 3)**

**A[i] = A[i] + 1;**

**else**

**A[i] = A[i] + B[i];**

**}**

1. **Minimum = 3 + 100 × 12 = 1203**

**Maximum = 3 + 100 × 13 = 1303**

In the loop (Inst4 to Inst17), there are two paths after Inst10: one that skips Inst11 and Inst12, and the other skips Inst13.

1. **Due to control: Inst2, Inst4, Inst11, Inst13**

**Due to data: Inst8, Inst10, Inst17**

1. **24 cycles (see page 10)**
2. **26 cycles (see page 11)**

Q2. (c)

|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| I1  beq | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I2  addi |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I3  sll |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I4  add |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I5  add |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I6  lw |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I7  lw |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I8  andi |  |  |  |  |  |  |  |  | F | D |  | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I9  addi |  |  |  |  |  |  |  |  |  | F | D |  | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I10  beq A1 |  |  |  |  |  |  |  |  |  |  | F |  |  | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I11  add |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I12  J A2 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I13  A1: addi |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |
| I14  A2:  sw |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |
| I15  addi |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |
| I16  slt |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |
| I17  beq |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F |  | D | E | M | W |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

Q2. (d)

|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| I1  beq | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I2  addi |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I3  sll |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I4  add |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I5  add |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I6  lw |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I7  lw |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I8  andi |  |  |  |  |  |  |  |  | F | D |  | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I9  addi |  |  |  |  |  |  |  |  |  | F | D |  | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I10  beq A1 |  |  |  |  |  |  |  |  |  |  | F |  |  | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I11  add |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |  |
| I12  J A2 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |  |  |  |
| I13  A1: addi |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I14  A2:  sw |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | W | M | W |  |  |  |  |  |  |  |  |
| I15  addi |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |  |
| I16  slt |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F | D | E | M | W |  |  |  |  |  |  |
| I17  beq |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | F |  | D | E | M | W |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

3.[AY2020/21 Semester 2 Exam]

Study the following MIPS code on integer arrays *A* and *B* which contain the same number of elements. $s0 and $s1 contain the base addresses of *A* and *B* respectively; $s2 is the number of elements in array *A*; $s5 is *count*.

**add $s5, $0, $0 # I1**

**add $t0, $0, $0 # I2**

**loop: slt $t8, $t0, $s2 # I3**

**beq $t8, $0, end # I4**

**sll $t1, $t0, 2 # I5**

**add $t3, $t1, $s0 # I6**

**lw $s3, 0($t3) # I7**

**andi $t9, $s3, 1 # I8**

**beq $t9, $0, skip # I9**

**add $t4, $t1, $s1 # I10**

**lw $s4, 0($t4) # I11**

**sub $s3, $s3, $s4 # I12**

**sw $s3, 0($t3) # I13**

**addi $s5, $s5, 1 # I14**

**skip: addi $t0, $t0, 1 # I15**

**j loop # I16**

**end:**

Assuming a 5-stage MIPS pipeline and all elements in array *A* are positive odd integers, answer the following questions. You need to count until the last stage of instruction I16.

(a) How many cycles does this code segment take to complete its execution in the first iteration (I1 to I16) in an ideal pipeline, that is, one with no delays?

For parts (b) to (d) below, given the assumption for each part, how many additional cycles does this code segment (I1 to I16) take to complete its execution in the first iteration as compared to an ideal pipeline? (For example, if part (a) takes 12 cycles and part (b) takes 20 cycles, you are to answer part (b) with the value 8 and not 20.)

(b) Assuming without forwarding and branch decision is made at MEM stage (stage 4). No branch prediction is made and no delayed branching is used.

(c) Assuming without forwarding and branch decision is made at ID stage (stage 2). No branch prediction is made and no delayed branching is used.

(d) Assuming with forwarding and branch decision is made at ID stage (stage 2). Branch prediction is made where the branch is predicted not taken, and no delayed branching is used.

(e) Assuming the setting in part (d) above and you are not allowed to modify any of the instructions, is it possible to reduce the additional delay cycles in part (d) by rearranging some instructions, and if possible, by how many cycles? Explain your answer. (Answer with no explanation will not be awarded any mark.)

***Answers:***

(a) 16 + 5 – 1 = **20** cycles.

Delays are highlighted under the columns (b), (c), (d) for parts (b),(c),(d) below respectively.

(b) **+24** cycles.

(c) **+20** cycles.

(d) **+4** cycles.

(b) (c) (d)

add $s5, $0, $0 # I1

add $t0, $0, $0 # I2

loop: slt $t8, $t0, $s2 # I3 +2 +2

beq $t8, $0, end # I4 +2 +2 +1

sll $t1, $t0, 2 # I5 +3 +1

add $t3, $t1, $s0 # I6 +2 +2

lw $s3, 0($t3) # I7 +2 +2

andi $t9, $s3, 1 # I8 +2 +2 +1

beq $t9, $0, skip # I9 +2 +2 +1

add $t4, $t1, $s1 # I10 +3 +1

lw $s4, 0($t4) # I11 +2 +2

sub $s3, $s3, $s4 # I12 +2 +2 +1

sw $s3, 0($t3) # I13 +2 +2

addi $s5, $s5, 1 # I14

skip: addi $t0, $t0, 1 # I15

j loop # I16

end:

**Total: +24 +20 +4**

(e) Other answers possible. Example:

Move I14 (addi $s5, $s5, 1) to between I11 (lw $s4, 0($t4)) and I12 (sub $s3, $s3, $s4) to remove the 1 cycle of delay at I12.